Housing Market Analysis in Madrid

Authors
Affiliation

Buchner, Julia

University of Lausanne

Reichle, Franziska

Sima, Stefano

Published

December 23, 2024

Abstract

This study analyzes the primary factors influencing rental and buying property prices in Madrid’s housing market, leveraging a dataset from Kaggle. The analysis involved extensive data cleaning to address missing values, standardize variables, and handle outliers, ensuring robust results. Exploratory data analysis revealed trends based on property type, size, number of rooms, energy certification, year of construction, location, and amenities like lifts and parking. Larger properties in central and northern neighborhoods command higher prices, while flats dominate the market due to their affordability and versatility. Luxury properties, such as penthouses, reflect a niche demand for exclusivity. Predictive modeling was conducted using Lasso Regression, Linear Regression, and Random Forest to forecast log-transformed rental and buying prices. Gaussian Mixture Models (GMM) were applied to classify rental prices into “Inexpensive” and “Expensive” categories, improving the prediction accuracy. Random Forest emerged as the best-performing model, achieving the lowest RMSE (0.1783596 for buying prices, 0.1300083 for rental prices) and the highest R² (0.9318762 and 0.9160775, respectively), demonstrating its ability to capture complex relationships in the data. Limitations included the dataset’s static nature and assumptions made for handling missing values. The results provide valuable insights for stakeholders, including investors and policymakers, and highlight the importance of location, property size, and amenities in determining housing prices. These findings establish a foundation for further research and predictive modeling in urban real estate markets.

1 Introduction

1.1 Project Goals

The primary goal of this project is to analyze the key factors influencing rental and buying property prices in Madrid, a city with a dynamic and competitive real estate landscape driven by rising urbanization and increasing demand. By conducting an exploratory data analysis, we aim to uncover insights into the characteristics and distributions of the data, focusing on factors such as property type, size, neighborhood, energy certification, and features like parking availability or garden access. This analysis also includes visualizing geographical trends across districts in Madrid to highlight patterns in house prices. To enhance the depth of the study, predictive modeling is employed to estimate property prices based on location and other attributes, allowing us to compare the performance of different models and determine the most effective approach for accurate predictions. This comprehensive analysis not only provides valuable insights for investors, policymakers, and residents but also serves as a reusable framework for analyzing real estate markets in other cities by adapting the code to different contexts.

1.3 Research Questions

  • Question 1: What are the primary factors that influence real estate prices in Madrid? This question analyses the correlation of the renting and/or buying price with other variables, such as house type, size and other characteristics of the house and property.
  • Question 2: How does the location within Madrid affect real estate prices? Are there specific neighbourhoods appreciating or depreciating the prices? By using the neighbourhood variable, we aim to map the properties to specific areas within Madrid and to identify price trends within those areas.
  • Question 3: What impact do secondary features of the house and property have on real estate prices? In comparison to the first question which focuses on the main features of the house – such as the type, size and number of rooms – this question focuses on additional characteristics – such as the type of energy certificate, availability of parking spaces or access to a garden – and their correlation with the real estate price.

2 Data

2.1 Sources

The dataset was sourced from Kaggle, a popular platform for data science competitions and sharing datasets. The specific dataset used is titled “Madrid Real Estate Market,” curated by Mirbek Toktogaraev. It contains detailed information about the real estate market in Madrid, including property characteristics, pricing, and location data. The dataset consists of listings from popular real estate portals of Madrid.

2.2 Description

The dataset consists of 21,742 rows and 58 columns, with a mix of numeric and categorical variables. The dataset is not regularly updated, meaning the analysis reflects conditions at the time of the data’s collection.

The key features can be grouped into several categories:

▫ sq_mt_built: Total built area (in square meters) of the property.
▫ n_rooms: Number of rooms in the property.
▫ n_bathrooms: Number of bathrooms in the property.
▫ floor: The floor number of the property.
▫ house_type_id: Type of the property (e.g., apartment, house).
▫ built_year: The year in which the property was constructed.

▫ neighborhood_id: The neighborhood in which the property is located.
▫ is_exact_address_hidden: A boolean indicating if the exact address is available or not.
Note: The columns for latitude and longitude (geographical coordinates) are present but completely null, so they cannot be used for geospatial analysis without additional data sources.

▫ rent_price: The rental price of the property (in euros).
▫ buy_price: The purchase price of the property (in euros).
▫ buy_price_by_area: Price per square meter for properties listed for sale (in euros/m²).

▫ has_parking: A boolean indicating whether the property has parking facilities.
▫ energy_certificate: The energy efficiency rating of the property.

2.3 Wrangling/Cleaning

To prepare the dataset for analysis, we implemented a comprehensive cleaning process to ensure data consistency and usability. This involved removing redundant and irrelevant columns, handling missing and erroneous values, standardizing categorical and numerical variables, and addressing outliers. Key steps included dropping columns with minimal analytical value, filling binary feature columns with logical FALSE values for missing data, converting text-based floor levels to numeric, and standardizing neighborhood names and house types. Additionally, we addressed outliers through visual inspection and corrected implausible values, such as an erroneous built_year.

More precisely, the steps included:

  1. Removing Redundant and Irrelevant Columns:

    • Dropped several columns not useful for analysis, including title, subtitle, raw_address, operation (contained only “sale”), and others with minimal impact on the analysis.
  2. Handling Missing Values:

    • Columns with all NA values were removed.
    • Filtered out rows with missing values in critical columns, such as built_year and floor.
    • Replaced NA values with FALSE in specific binary variables, including has_ac, has_garden, has_pool, has_terrace, has_balcony, has_storage_room, and is_accessible.
      This decision was based on the observation that only TRUE values were present in these columns, likely indicating that the platform allowed users to specify features they had (by selecting TRUE) but not features they lacked (by selecting FALSE). Therefore, any NA values were logically assumed to mean the absence of the feature.
  3. Handling negative values in rent_price

    • Columns with all negative values were removed as negative rent_prices do not make sense.
  4. Remove specific columns that are redundant or not useful.

  5. Remove observations where built_year or floor is NA.

  6. Fill in the NA values for certain columns with FALSE, as mentioned in point 2.

  7. Floor Level Conversion:

    • Cleaned the floor variable by converting text-based descriptions to numeric values:
      • Converted “Bajo” (ground floor) to 0, “Sótano” (basement) to -1, “Semi-sótano” (semi-basement) to -0.5, and “Entreplanta” (mezzanine) to 0.5.
    • Converted the floor column to numeric for analysis.
  8. Modify values in house_type_id:

    • Converted columns neighborhood_id, house_type_id, and energy_certificate to factors.
    • Standardized house_type_id values as follows:
      • HouseType 1: Pisos renamed to Flat
      • HouseType 4: Dúplex renamed to Duplex
      • HouseType 5: Áticos renamed to Penthouse
      • HouseType 2: Casa o chalet (not present in final cleaned data)
  9. Modify values in energy_certificate:

    • Modified values in energy_certificate to standardize the categories:
      • Added categories for energy ratings: A, B, C, D, E, F, G
      • Renamed “en trámite” to “in process”
      • Renamed “no indicado” to “not indicated”
      • Renamed “inmueble exento” to “exempt”
  10. Verify which rows still have NA values after the above cleaning steps and remove those.

  11. Cleaning Neighborhood Information:

    • Standardized neighborhood names in neighborhood_id.
  12. Cleaning Neighborhood Information:

    • Removing numerical identifiers and irrelevant text to make data handling easier.
  13. Outlier Detection and Adjustment:

    • Created boxplots to visually inspect outliers in numeric variables, leading to corrections like changing a built_year value from 8170 to 1870.

    • Removed observations with negative values in rent_price, as these were likely erroneous.

  14. Remove has_garden column:

    After all the cleaning steps, we realized that there are only FALSE values in the has_garden column and all the TRUE values have been deleted due to other cleaning steps. This is why we decided to remove the has_garden column.

Code
# Load dataset and remove the first row because it contains a header
houses_Madrid <- suppressMessages(
  read_excel(here::here("data/processed/houses_Madrid.xlsx"), skip = 1)
)
data <- houses_Madrid

# Convert specific columns to factors (categorical variables)
data <- data %>%
  mutate(
    neighborhood_id = as.factor(neighborhood_id),
    house_type_id = as.factor(house_type_id),
    energy_certificate = as.factor(energy_certificate)
  )

# A. Introducing a cleaned dataset with the following changes

# 1. Remove the first column because it is redundant
data_cleaned <- data[, -1]

# 2. Remove columns that have all NA values except the header
data_cleaned <- data_cleaned[, colSums(!is.na(data_cleaned)) > 0]

# 3. Remove observations with a negative rent_price
data_cleaned <- data_cleaned[data_cleaned$rent_price >= 0, ]

# 4. Remove specific columns that are redundant or not useful
data_cleaned <- data_cleaned %>%
  select(
    -title,                # Covered by neighborhood_id
    -subtitle,             # Covered by neighborhood_id
    -raw_address,          # Covered by neighborhood_id
    -street_name,          # Covered by neighborhood_id
    -street_number,        # Covered by neighborhood_id
    -n_floors,             # Less important than floors data
    -is_floor_under,       # Redundant with floors data
    -sq_mt_useful,         # Less important than sq_mt_built
    -sq_mt_allotment,      # Less important than sq_mt_built
    -is_rent_price_known,  # All values are FALSE
    -is_buy_price_known,   # All values are TRUE
    -operation,            # All values are "sale"
    -buy_price_by_area,    # rent_price_by_area is not known, removal to facilitate comparison
    -is_new_development,   # Redundant with built_year
    -has_central_heating,  # Less important
    -has_individual_heating, # Less important
    -has_fitted_wardrobes, # Less important
    -is_exterior,          # Less important
    -has_green_zones,      # Less important
    -is_parking_included_in_price, # Less important
    -parking_price,        # Less important
    -is_orientation_north, # Less important
    -is_orientation_west,  # Less important
    -is_orientation_south, # Less important
    -is_orientation_east   # Less important
  )

# 5. Remove observations where 'built_year' or 'floor' is NA
data_cleaned <- data_cleaned %>%
  filter(!is.na(built_year), !is.na(floor))

# 6. Fill in NA values for certain columns with "FALSE"
data_cleaned <- data_cleaned %>%
  mutate(
    has_ac = ifelse(is.na(has_ac), FALSE, has_ac),
    has_garden = ifelse(is.na(has_garden), FALSE, has_garden),
    has_pool = ifelse(is.na(has_pool), FALSE, has_pool),
    has_terrace = ifelse(is.na(has_terrace), FALSE, has_terrace),
    has_balcony = ifelse(is.na(has_balcony), FALSE, has_balcony),
    has_storage_room = ifelse(is.na(has_storage_room), FALSE, has_storage_room),
    is_accessible = ifelse(is.na(is_accessible), FALSE, is_accessible)
  )

# 7. In the column floor:
# 7.1 "bajo" is used to say groundfloor, we are changing that to simply zero
# 7.2 "Semi-sótano", "Semi-sótano exterior", "Semi-sótano interior" is used to say semi-basement, we are changing that to -0.5
# 7.3 "Sótano", "Sótano exterior", "Sótano interior" is used to say basement, we are changing that to -1
# 7.4 "Entreplanta", "Entreplanta exterior", "Entreplanta interior" is used to say mezzanine, we are changing that to 0.5

data_cleaned <- data_cleaned %>%
  mutate(
    floor = case_when(
      floor == "Bajo" ~ "0",
      floor %in% c("Semi-sótano", "Semi-sótano exterior", "Semi-sótano interior") ~ "-0.5",
      floor %in% c("Sótano", "Sótano exterior", "Sótano interior") ~ "-1",
      floor %in% c("Entreplanta", "Entreplanta exterior", "Entreplanta interior") ~ "0.5",
      TRUE ~ floor
    )
  ) %>%
  mutate(floor = as.numeric(floor))  # Convert to numeric for analysis

# 8. Rename values in house_type_id column
data_cleaned <- data_cleaned %>%
  mutate(
    house_type_id = trimws(house_type_id),  # Remove any leading/trailing whitespace
    house_type_id = case_when(
      house_type_id == "HouseType 1: Pisos" ~ "Flat",
      house_type_id == "HouseType 2: Casa o chalet" ~ "House",
      house_type_id == "HouseType 4: Dúplex" ~ "Duplex",
      house_type_id == "HouseType 5: Áticos" ~ "Penthouse",
      TRUE ~ house_type_id  # Keep other values unchanged
    )
  ) %>%
  filter(!is.na(house_type_id))  # Remove rows with NA in 'house_type_id'

# 9. Modify values in the 'energy_certificate' column
data_cleaned <- data_cleaned %>%
  mutate(energy_certificate = case_when(
    energy_certificate == "en trámite" ~ "in process",
    energy_certificate == "no indicado" ~ "not indicated",
    energy_certificate == "inmueble exento" ~ "exempt",
    TRUE ~ energy_certificate  # Keep original values for other cases
  ))

# 10. Check which rows still have NA values and remove those
data_cleaned <- data_cleaned %>%
  filter(!is.na(sq_mt_built), !is.na(n_bathrooms), !is.na(has_lift))

# 11. Clean the neighborhood_id column to keep only district names
data_cleaned <- data_cleaned %>%
  mutate(
    # Extract only the district name from neighborhood_id by removing everything before "District X: "
    neighborhood_id = str_replace(neighborhood_id, ".*District \\d+: ", ""),

# 12. Remove any leading or trailing whitespace in district names
    neighborhood_id = trimws(neighborhood_id)
  )

# 13. Handling outliers
# 13.1 Identify outliers using boxplots
numeric_columns <- sapply(data_cleaned, is.numeric)  # Identify numeric columns
numeric_data <- data_cleaned[, numeric_columns]  # Subset numeric data

# 13.2 Create boxplots for each numeric variable and display them two by two
plot_list <- list()  # List to store plots

for (column in colnames(numeric_data)) {
  plot <- ggplot(data_cleaned, aes_string(x = "factor(1)", y = column)) +
    geom_boxplot() +
    labs(title = paste("Boxplot of", column), y = column) +
    theme_minimal() +
    theme(axis.title.x = element_blank())
  
  plot_list <- append(plot_list, list(plot))
}

Initial boxplot visualizations were used to identify potential outliers and anomalies in the data (as mentioned in point 13). Additional visualizations and in-depth analyses will be conducted as part of the Exploratory Data Analysis (EDA) chapter.

Code
# Display the boxplots two by two
for (i in seq(1, length(plot_list), by = 2)) {
  if (i + 1 <= length(plot_list)) {
    print(plot_list[[i]] + plot_list[[i + 1]])  # Print two plots side by side
  } else {
    print(plot_list[[i]])  # If odd number of plots, print the last one by itself
  }
}

Code
# 13.3 Check the data for anomalies and replace the incorrect value of 8170 with 1870
data_cleaned$built_year[data_cleaned$built_year == 8170] <- 1870

# 14. Removing the has_garden column from the data_cleaned dataset
data_cleaned <- data_cleaned %>%
  select(-has_garden)
Code
summary(data)
      ...1             id           title             subtitle        
 Min.   :    0   Min.   :    1   Length:21742       Length:21742      
 1st Qu.: 5435   1st Qu.: 5436   Class :character   Class :character  
 Median :10870   Median :10872   Mode  :character   Mode  :character  
 Mean   :10870   Mean   :10872                                        
 3rd Qu.:16306   3rd Qu.:16307                                        
 Max.   :21741   Max.   :21742                                        
                                                                      
  sq_mt_built     sq_mt_useful      n_rooms        n_bathrooms    
 Min.   : 13.0   Min.   :  1.0   Min.   : 0.000   Min.   : 1.000  
 1st Qu.: 70.0   1st Qu.: 59.0   1st Qu.: 2.000   1st Qu.: 1.000  
 Median :100.0   Median : 79.0   Median : 3.000   Median : 2.000  
 Mean   :146.9   Mean   :103.5   Mean   : 3.006   Mean   : 2.092  
 3rd Qu.:162.0   3rd Qu.:113.0   3rd Qu.: 4.000   3rd Qu.: 2.000  
 Max.   :999.0   Max.   :998.0   Max.   :24.000   Max.   :16.000  
 NA's   :126     NA's   :13514                    NA's   :16      
    n_floors     sq_mt_allotment latitude       longitude     
 Min.   :1.000   Min.   :  1.0   Mode:logical   Mode:logical  
 1st Qu.:2.000   1st Qu.:  2.0   NA's:21742     NA's:21742    
 Median :3.000   Median :232.0                                
 Mean   :3.129   Mean   :241.7                                
 3rd Qu.:4.000   3rd Qu.:354.0                                
 Max.   :7.000   Max.   :997.0                                
 NA's   :20305   NA's   :20310                                
 raw_address        is_exact_address_hidden street_name       
 Length:21742       Mode :logical           Length:21742      
 Class :character   FALSE:6740              Class :character  
 Mode  :character   TRUE :15002             Mode  :character  
                                                              
                                                              
                                                              
                                                              
 street_number       portal           floor           is_floor_under 
 Length:21742       Mode:logical   Length:21742       Mode :logical  
 Class :character   NA's:21742     Class :character   FALSE:18033    
 Mode  :character                  Mode  :character   TRUE :2539     
                                                      NA's :1170     
                                                                     
                                                                     
                                                                     
   door        
 Mode:logical  
 NA's:21742    
               
               
               
               
               
                                                                    neighborhood_id 
 Neighborhood 23: Malasaña-Universidad (5196.25 €/m2) - District 4: Centro  :  485  
 Neighborhood 59: Conde Orgaz-Piovera (4275.46 €/m2) - District 9: Hortaleza:  471  
 Neighborhood 28: El Viso (6255.45 €/m2) - District 5: Chamartín            :  467  
 Neighborhood 32: Almagro (6564.27 €/m2) - District 6: Chamberí             :  423  
 Neighborhood 72: Aravaca (3600.4 €/m2) - District 11: Moncloa              :  419  
 Neighborhood 22: Lavapiés-Embajadores (4448.3 €/m2) - District 4: Centro   :  385  
 (Other)                                                                    :19092  
  operation           rent_price        rent_price_by_area is_rent_price_known
 Length:21742       Min.   :-34590276   Mode:logical       Mode :logical      
 Class :character   1st Qu.:      725   NA's:21742         FALSE:21742        
 Mode  :character   Median :     1116                                         
                    Mean   :   -59170                                         
                    3rd Qu.:     1687                                         
                    Max.   :     2517                                         
                                                                              
   buy_price       buy_price_by_area is_buy_price_known
 Min.   :  36000   Min.   :  447     Mode:logical      
 1st Qu.: 198000   1st Qu.: 2551     TRUE:21742        
 Median : 375000   Median : 3720                       
 Mean   : 653736   Mean   : 4021                       
 3rd Qu.: 763600   3rd Qu.: 5000                       
 Max.   :8800000   Max.   :18889                       
                                                       
                    house_type_id   is_renewal_needed is_new_development
 HouseType 1: Pisos        :17705   Mode :logical     Mode :logical     
 HouseType 2: Casa o chalet: 1938   FALSE:17747       FALSE:19281       
 HouseType 4: Dúplex       :  676   TRUE :3995        TRUE :1469        
 HouseType 5: Áticos       : 1032                     NA's :992         
 NA's                      :  391                                       
                                                                        
                                                                        
   built_year    has_central_heating has_individual_heating are_pets_allowed
 Min.   :1723    Mode :logical       Mode :logical          Mode:logical    
 1st Qu.:1957    FALSE:9516          FALSE:4092             NA's:21742      
 Median :1970    TRUE :4092          TRUE :9516                             
 Mean   :1970    NA's :8134          NA's :8134                             
 3rd Qu.:1994                                                               
 Max.   :8170                                                               
 NA's   :11742                                                              
  has_ac        has_fitted_wardrobes  has_lift       is_exterior    
 Mode:logical   Mode:logical         Mode :logical   Mode :logical  
 TRUE:11211     TRUE:13399           FALSE:4461      FALSE:1777     
 NA's:10531     NA's:8343            TRUE :14895     TRUE :16922    
                                     NA's :2386      NA's :3043     
                                                                    
                                                                    
                                                                    
 has_garden     has_pool       has_terrace    has_balcony    has_storage_room
 Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical    
 TRUE:1556      TRUE:5171      TRUE:9548      TRUE:3321      TRUE:7698       
 NA's:20186     NA's:16571     NA's:12194     NA's:18421     NA's:14044      
                                                                             
                                                                             
                                                                             
                                                                             
 is_furnished   is_kitchen_equipped is_accessible  has_green_zones
 Mode:logical   Mode:logical        Mode:logical   Mode:logical   
 NA's:21742     NA's:21742          TRUE:4074      TRUE:4057      
                                    NA's:17668     NA's:17685     
                                                                  
                                                                  
                                                                  
                                                                  
   energy_certificate has_parking     has_private_parking has_public_parking
 en trámite :10937    Mode :logical   Mode:logical        Mode:logical      
 no indicado: 3669    FALSE:14023     NA's:21742          NA's:21742        
 E          : 2701    TRUE :7719                                            
 D          : 1121                                                          
 G          :  894                                                          
 F          :  674                                                          
 (Other)    : 1746                                                          
 is_parking_included_in_price parking_price    is_orientation_north
 Mode :logical                Min.   :     0   Mode :logical       
 FALSE:604                    1st Qu.:     0   FALSE:8567          
 TRUE :7115                   Median :     0   TRUE :2791          
 NA's :14023                  Mean   :  2658   NA's :10384         
                              3rd Qu.:     0                       
                              Max.   :600000                       
                              NA's   :14023                        
 is_orientation_west is_orientation_south is_orientation_east
 Mode :logical       Mode :logical        Mode :logical      
 FALSE:7287          FALSE:5535           FALSE:6302         
 TRUE :4071          TRUE :5823           TRUE :5056         
 NA's :10384         NA's :10384          NA's :10384        
                                                             
                                                             
                                                             
Code
summary(data_cleaned)
       id         sq_mt_built       n_rooms        n_bathrooms    
 Min.   :    3   Min.   : 16.0   Min.   : 0.000   Min.   : 1.000  
 1st Qu.: 4786   1st Qu.: 68.0   1st Qu.: 2.000   1st Qu.: 1.000  
 Median :10324   Median : 90.0   Median : 3.000   Median : 2.000  
 Mean   :10644   Mean   :105.6   Mean   : 2.741   Mean   : 1.675  
 3rd Qu.:16448   3rd Qu.:129.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
 Max.   :21742   Max.   :620.0   Max.   :14.000   Max.   :14.000  
 is_exact_address_hidden     floor        neighborhood_id      rent_price  
 Mode :logical           Min.   :-1.000   Length:7854        Min.   :  35  
 FALSE:1828              1st Qu.: 1.000   Class :character   1st Qu.: 850  
 TRUE :6026              Median : 2.000   Mode  :character   Median :1223  
                         Mean   : 2.602                      Mean   :1335  
                         3rd Qu.: 4.000                      3rd Qu.:1758  
                         Max.   : 9.000                      Max.   :2517  
   buy_price       house_type_id      is_renewal_needed   built_year  
 Min.   :  36000   Length:7854        Mode :logical     Min.   :1800  
 1st Qu.: 180000   Class :character   FALSE:6203        1st Qu.:1956  
 Median : 315000   Mode  :character   TRUE :1651        Median :1970  
 Mean   : 404489                                        Mean   :1968  
 3rd Qu.: 550000                                        3rd Qu.:1991  
 Max.   :1378900                                        Max.   :2022  
   has_ac         has_lift        has_pool       has_terrace    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:3546      FALSE:1928      FALSE:6388      FALSE:4583     
 TRUE :4308      TRUE :5926      TRUE :1466      TRUE :3271     
                                                                
                                                                
                                                                
 has_balcony     has_storage_room is_accessible   energy_certificate
 Mode :logical   Mode :logical    Mode :logical   Length:7854       
 FALSE:6565      FALSE:5386       FALSE:5679      Class :character  
 TRUE :1289      TRUE :2468       TRUE :2175      Mode  :character  
                                                                    
                                                                    
                                                                    
 has_parking    
 Mode :logical  
 FALSE:5519     
 TRUE :2335     
                
                
                

2.4 Spotting Mistakes and Missing Data

During the cleaning process, the following issues with mistakes and missing data were identified and addressed:

  • Missing Values: Columns with high NA counts were removed or had missing values filled with FALSE where appropriate. For example, missing data in has_ac, has_garden, and has_pool was assumed to indicate the absence of these features.
  • Wrong Value: We corrected a wrong value in built_year (from 8170 to 1870), as this seems to be twisted numbers.
  • Observations with negative values in rent_price were removed, as this did not make sense.

2.5 Listing Anomalies and Outliers

  • Outlier Identification: Used boxplots for visual inspection. If “>1.5xIQR”, then considered outlier
  • Handling Strategy: Retained legitimate outliers (e.g., high-end properties in prime locations) but removed invalid values (e.g., negative rent prices). Outliers were kept because housing properties can vary based e.g on luxuriousness, and other factors that are not included in our variables. Those high-value properties or very large properties may be unique but still real. In the Madrid housing market, this might include luxury penthouses, historic homes, or properties in prime neighborhoods. By keeping the outliers, we preserves data integrity and can also model for high end or unique properties.

3 Exploratory Data Analysis (EDA)

3.1 Initial Visualizations

3.1.1 Histograms of Rental and Buying prices

Rental: The concentration of rental prices between €800 and €1,200 suggests that there is a significant offer for affordable housing, making this the competitive range for renters. The fewer higher-priced rentals beyond €2,000 may indicate luxury or specialized properties, perhaps in prime locations or with additional amenities. This distribution reflects a standard market pattern where most rental offerings cater to budget-conscious tenants, with a small segment targeting high-income renters. There is also a noticeable spike at €2,200, which may represent a group of high-end properties. This could be due to larger properties, luxury amenities, or prime locations in desirable neighborhoods.

Buying: The buying market is accessible for a broad range of buyers, with most properties under €500,000. The long tail extending past €1,000,000 likely includes high-end or exclusive properties. This pattern indicates that while Madrid’s market has an affordable foundation, there’s also room for luxury properties, suggesting a market that accommodates both average-income buyers and high-end investors. Compared to the rent histogram, we see significantly more outliers in the high price segment. This could come from the fact that very expensive properties could be more likely to be sold instead of rented. An example would be that people usually prefer buying houses compared to renting them, as it is seen as an investment.

Code
# ADD UNIFIED COLOUR SCHEME
# Rent Prices: Blue color scheme
rent_color = "blue"
rent_color_light = "lightblue"

# Buy Prices: Green color scheme
buy_color = "green"
buy_color_light = "lightgreen"

# general colour if not specifically related to one of the 2 categories
general_color = "purple"
general_color_light = "orchid"
Code
# 1.1 Histogram of rental prices (Interactive)

plotly::plot_ly(
  data = data_cleaned, x = ~rent_price, type = "histogram",
  marker = list(color = rent_color)
) %>%
  layout(
    title = "Distribution of Rental Prices",
    xaxis = list(title = "Rental Price (€)"),
    yaxis = list(title = "Frequency")
  )
Code
# 1.2 Histogram of buying prices (Interactive)

plotly::plot_ly(
  data = data_cleaned, x = ~buy_price, type = "histogram",
  marker = list(color = buy_color)
) %>%
  layout(
    title = "Distribution of Buying Prices",
    xaxis = list(title = "Buying Price (€)"),
    yaxis = list(title = "Frequency")
  )

3.1.2 Energy Certificate Overview

3.1.2.1 Bar Chart

Most properties are either categorized as “in process” or have a “not indicated” energy certificate, meaning that a large portion of the dataset lacks finalized or available energy efficiency information. Among properties with specified ratings, the lower categories have the highest counts. This pattern could reflect the age of properties in the dataset, as older buildings tend to have lower energy efficiency. As Madrid is a city with many storic building that would match thet assumption.

Code
# 2.1 Bar chart for energy_certificate (interactive)

plotly::plot_ly(
  data = data_cleaned %>% count(energy_certificate),
  x = ~energy_certificate, y = ~n, type = "bar",
  marker = list(color = general_color)
) %>%
  layout(
    title = "Count of Energy Certificates",
    xaxis = list(title = "Energy Certificate"),
    yaxis = list(title = "Count")
  )

3.1.2.2 Boxplots Rental and Buying Prices by Energy Certificate

The boxplot of buying and rental prices by energy certificate shows no clear upward trend in property prices with improved energy ratings. While properties with ratings A and C appear to have slightly higher medians, there is still considerable overlap between most categories. Properties in categories like F and G still have a wide range of prices, showing that even less energy-efficient properties can have high prices, possibly due to factors like location or property size. As there is no strong price premium on energy-efficient properties, it could imply that buyers in this market are not placing a high value on energy ratings, or it may indicate that other factors, like location and size, are more influential in determining property prices.

Code
# 2.2 Boxplot for rental prices by energy certificate (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~energy_certificate, y = ~rent_price, type = "box",
  marker = list(color = rent_color)
) %>%
  layout(
    title = "Rental Prices by Energy Certificate",
    xaxis = list(title = "Energy Certificate"),
    yaxis = list(title = "Rental Price (€)")
  )
Code
# 2.3 Boxplot for buying prices by energy certificate (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~energy_certificate,
  y = ~buy_price,
  type = "box",
  boxpoints = "outliers",  # Show outliers
  line = list(color = buy_color),  # Set green color for box lines and whiskers
  marker = list(color = buy_color),  # Ensure outlier points are green
  fillcolor = buy_color_light  # Fill boxes with light green
) %>%
  layout(
    title = "Buying Prices by Energy Certificate",
    xaxis = list(title = "Energy Certificate"),
    yaxis = list(title = "Buying Price (€)")
  )

3.1.3 House Type

3.1.3.1 Bar Chart

Flats make up the vast majority of properties in the dataset, with very few duplexes and penthouses. This distribution suggests that flats are the predominant housing type available in the market. As Madrid is an urban area, apartments are a common choice due to high population density and limited space.

Code
# 2.4 Bar chart for house_type_id (interactive)

plotly::plot_ly(
  data = data_cleaned %>% count(house_type_id),
  x = ~house_type_id, y = ~n, type = "bar",
  marker = list(color = general_color)
) %>%
  layout(
    title = "Count of House Types",
    xaxis = list(title = "House Type"),
    yaxis = list(title = "Count")
  )

3.1.3.2 Boxplots rental and buy prices by house type

Based on the median, duplexes and penthouses have higher prices, which highlights their role as premium options in the housing market. These types of properties often offer larger spaces, distinctive layouts, superior views, and greater privacy, which are all features that appeal to both buyers and renters prepared to invest more. Flats, on the other hand, cater to a wider audience, covering a range of price points from affordable to mid-range and even some high-end units. This range of pricing for flats reflects a flexible market. Flats act as the base for various buyers and renters, while duplexes and penthouses meet the specific demands of those seeking more exclusive, upscale living spaces.

Code
# 2.5 Boxplot for rental prices by house type (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~house_type_id, y = ~rent_price, type = "box",
  marker = list(color = rent_color)
) %>%
  layout(
    title = "Rental Prices by House Type",
    xaxis = list(title = "House Type"),
    yaxis = list(title = "Rental Price (€)")
  )
Code
# 2.6 Boxplot for buying prices by house type (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~house_type_id, y = ~buy_price, type = "box",
  fillcolor = buy_color_light,  # Fill the boxes with light green
  line = list(color = buy_color),  # Set the box outlines and whiskers to green
  marker = list(color = buy_color)  # Ensure outliers are green
) %>%
  layout(
    title = "Buying Prices by House Type",
    xaxis = list(title = "House Type"),
    yaxis = list(title = "Buying Price (€)")
  )

3.1.4 Size on Price

The positive correlation between property size and buy and rental price indicates that the size of the property is a key determinant of both rental and purchase prices. Larger properties tend to be more expensive, both to rent and buy, aligning with market expectations. However, for properties above a certain size (around 200 square meters), price variability increases, reflecting the influence of additional factors like neighborhood, and other specific property features. This variability suggests a premium segment in the market for larger, potentially luxury properties, where size alone doesn’t set the price but combines with other value-adding factors. But also properties that probably are in need of renovation can be a cause for the lower priced ones. What is also evident in the scatter plot square meters on rent price is that there is a very distinct upper border for the prices. This could indicate some sort of capped rental prices in the period where the data was pulled. Another potential explanation could be that there is no demand for flats in that price segment, so owners tend to sell instead of rent the properties. Considering the max price border is that distinct, the first explanation appears more likely.

Code
# 3.1 Scatter plot for sq_mt_built vs. rent_price (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~sq_mt_built, y = ~rent_price,
  type = "scatter", mode = "markers",
  marker = list(color = rent_color, opacity = 0.5)  # Add transparency
) %>%
  layout(
    title = "Size vs Rental Prices",
    xaxis = list(title = "Square Meters Built"),
    yaxis = list(title = "Rental Price (€)")
  )
Code
# 3.2 Scatter plot for sq_mt_built vs. buy_price (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~sq_mt_built, y = ~buy_price,
  type = "scatter", mode = "markers",
  marker = list(color = buy_color, opacity = 0.5)  # Add transparency
) %>%
  layout(
    title = "Size vs Buying Prices",
    xaxis = list(title = "Square Meters Built"),
    yaxis = list(title = "Buying Price (€)")
  )

3.1.5 Age on Price

The relationship between a property’s age and its price indicates that both buyers and renters value newer constructions. This could reflect demand for properties with modern amenities, better energy efficiency, and improved construction standards. However, the higher prices for very old buildings (from the 1800s) suggest that there is also a niche market for historic properties, likely due to their architectural charm or premium locations. For late-mid-20th-century properties, the lower prices may indicate a perception of these buildings as lacking in either historic value or modern features, making them less appealing. Those buildings are often classified as post second world war builings, where functionality and fast rebuilt was prioritized over aesthetics. These observations could be helpful for developers or investors targeting renovations or upgrades to increase appeal in older properties, especially mid-century ones that might benefit from modernization.

Code
# 4. Relationship Between Property Age and Prices
# 4.1 Average rent and buy prices by built year
price_by_built_year <- data_cleaned %>%
  group_by(built_year) %>%
  summarize(
    avg_rent_price = mean(rent_price, na.rm = TRUE),
    avg_buy_price = mean(buy_price, na.rm = TRUE)
  )
Code
# 4.2 Scatter plot for average rental prices by built year (interactive)

plotly::plot_ly(
  data = price_by_built_year,
  x = ~built_year,
  y = ~avg_rent_price,
  type = "scatter",
  mode = "markers",
  marker = list(color = rent_color, opacity = 0.6, size = 5)  # Reduced point size
) %>%
  add_trace(
    x = ~price_by_built_year$built_year,
    y = ~predict(loess(avg_rent_price ~ built_year, data = price_by_built_year)),
    mode = "lines",
    line = list(color = rent_color, width = 2)  # Smooth loess line with specified color and width
  ) %>%
  layout(
    title = "Average Rental Prices by Year Built",
    xaxis = list(title = "Year Built"),
    yaxis = list(title = "Average Rent Price (€)"),
    showlegend = FALSE  # Hide legend for simplicity
  )
Code
# 4.3 Scatter plot for average buying prices by built year (interactive)

plotly::plot_ly(
  data = price_by_built_year,
  x = ~built_year,
  y = ~avg_buy_price,
  type = "scatter",
  mode = "markers",
  marker = list(color = buy_color, opacity = 0.6, size = 5)  # Smaller points with transparency
) %>%
  add_trace(
    x = ~price_by_built_year$built_year,
    y = ~predict(loess(avg_buy_price ~ built_year, data = price_by_built_year)),
    mode = "lines",
    line = list(color = buy_color, width = 2)  # Loess smoothing line
  ) %>%
  layout(
    title = "Average Buying Prices by Year Built",
    xaxis = list(title = "Year Built"),
    yaxis = list(title = "Average Buy Price (€)"),
    showlegend = FALSE  # Remove legend for simplicity
  )

3.1.6 Number of rooms and bathrooms on Price

The analysis indicates that the number of rooms and bathrooms significantly impacts both rental and buying prices. Properties with more rooms and bathrooms typically fall into the higher-end segment, appealing to families or buyers with greater purchasing power who value spaciousness and comfort. In the buying market, additional rooms and bathrooms noticeably drive up property values, as buyers appear willing to pay extra for these features. In contrast, the rental market shows a clear price ceiling, especially for larger properties, which suggests there’s a limit to what renters can afford or are willing to pay, regardless of additional space or amenities. This difference highlights that while both size and features add value, rental prices are generally more limited by affordability constraints than buying prices.

Code
# 5.1 Boxplot for Rent Price by Number of Rooms (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~as.factor(n_rooms),
  y = ~rent_price,
  type = "box",
  fillcolor = rent_color_light,
  line = list(color = rent_color),
  marker = list(color = rent_color)
) %>%
  layout(
    title = "Rent Price by Number of Rooms",
    xaxis = list(title = "Number of Rooms"),
    yaxis = list(title = "Rent Price (€)")
  )
Code
# 5.2 Boxplot for Rent Price by Number of Bathrooms (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~as.factor(n_bathrooms),
  y = ~rent_price,
  type = "box",
  fillcolor = rent_color_light,
  line = list(color = rent_color),
  marker = list(color = rent_color)
) %>%
  layout(
    title = "Rent Price by Number of Bathrooms",
    xaxis = list(title = "Number of Bathrooms"),
    yaxis = list(title = "Rent Price (€)")
  )
Code
# 5.3 Boxplot for Buy Price by Number of Rooms (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~as.factor(n_rooms),
  y = ~buy_price,
  type = "box",
  fillcolor = buy_color_light,
  line = list(color = buy_color),
  marker = list(color = buy_color)
) %>%
  layout(
    title = "Buy Price by Number of Rooms",
    xaxis = list(title = "Number of Rooms"),
    yaxis = list(title = "Buy Price (€)")
  )
Code
# 5.4 Boxplot for Buy Price by Number of Bathrooms (interactive)

plotly::plot_ly(
  data = data_cleaned,
  x = ~as.factor(n_bathrooms),
  y = ~buy_price,
  type = "box",
  fillcolor = buy_color_light,
  line = list(color = buy_color),
  marker = list(color = buy_color)
) %>%
  layout(
    title = "Buy Price by Number of Bathrooms",
    xaxis = list(title = "Number of Bathrooms"),
    yaxis = list(title = "Buy Price (€)")
  )

3.1.7 Correlation Analysis

The heatmap illustrates the correlations between various numerical variables in the dataset. The strongest positive correlations are observed between square meters built and number of rooms, as well as between square meters built and buy price, suggesting that larger properties generally have more rooms and higher prices. Similarly, there is a moderate positive correlation between rent price and buy price, which indicates that properties with higher buying prices tend to also have higher rental prices. On the other hand, built year shows weak correlations with other variables, suggesting that the age of the property has a limited direct impact on its price or other features. However, this relationship may be influenced by factors not captured in the dataset, such as the property’s condition, location, or the demand for modern amenities versus historic value.

Code
# 6. Correlation Matrix
cor_matrix <- cor(data_cleaned[sapply(data_cleaned, is.numeric)], use = "complete.obs")
cor_melted <- melt(cor_matrix)

# Heatmap of the correlation matrix (interactive)

plotly::plot_ly(
  data = cor_melted,
  x = ~Var1,
  y = ~Var2,
  z = ~value,
  type = "heatmap",
  colorscale = list(
    list(0, "blue"),   # -1 correlation (blue)
    list(0.5, "white"), # 0 correlation (white)
    list(1, "red")     # +1 correlation (red)
  ),
  zmin = -1,  # Minimum value for the color scale
  zmax = 1,   # Maximum value for the color scale
  colorbar = list(title = "Correlation")
) %>%
  layout(
    title = "Correlation Matrix Heatmap",
    xaxis = list(title = "", tickangle = 45),  # Rotate x-axis labels for better readability
    yaxis = list(title = "")
  )

3.1.8 Rental and Buying Prices by Neighborhood

The highest rent prices per square meter are concentrated in central Madrid, with darker shades indicating premium areas. This central cluster is likely due to the demand for proximity to amenities, Schools, business centers, and cultural attractions. Beyond the central area, northern neighborhoods also show relatively high rent prices, suggesting they are seen as desirable places to live, possibly due to quality infrastructure, green spaces, and high-end residential areas. In contrast, the southern neighborhoods generally show lower rent prices, implying that they may be more affordable or less in demand compared to the north and center. Similar to rental prices, buying prices per square meter are highest in central and northern neighborhoods, with a clear gradient from high prices in the north to lower prices in the south. The northern part of Madrid likely attracts higher buying prices due to factors such as newer developments, higher-end residential zones, and perhaps perceived prestige.

Code
# 7. Relationship between Neighborhood and Price
# 7.1 Create a new dataset and calculate average rent and buy price per square meter by neighborhood
neighborhood_prices <- data_cleaned %>%
  mutate(
    rent_price_per_sqm = rent_price / sq_mt_built,
    buy_price_per_sqm = buy_price / sq_mt_built
  ) %>%
  group_by(neighborhood_id) %>%
  summarise(
    avg_rent_price_per_sqm = mean(rent_price_per_sqm, na.rm = TRUE),
    avg_buy_price_per_sqm = mean(buy_price_per_sqm, na.rm = TRUE)
  )

# 7.2 Download Shapefile of Madrid
# Adjust the path to include the Barrios folder in your Downloads directory
madrid_shapefile <- st_read("data/raw/Distritos/Distritos.shp", quiet = TRUE)


# 7.3 Update names in neighborhood_prices to match shapefile if needed
neighborhood_prices <- neighborhood_prices %>%
  mutate(neighborhood_id = case_when(
    neighborhood_id == "Moncloa" ~ "Moncloa - Aravaca",
    neighborhood_id == "Fuencarral" ~ "Fuencarral - El Pardo",
    TRUE ~ neighborhood_id
  ))

# 7.4 Merge neighborhood_prices with madrid_shapefile based on neighborhood name
merged_neighborhood_prices <- madrid_shapefile %>%
  left_join(neighborhood_prices, by = c("NOMBRE" = "neighborhood_id"))

# Calculate area of each neighborhood polygon for scaling text size and add line breaks after the first blank space in district names
merged_neighborhood_prices <- merged_neighborhood_prices %>%
  mutate(area = as.numeric(st_area(geometry)), # Ensure 'area' column for font scaling
         NOMBRE_wrapped = sub(" ", "\n", NOMBRE)) # Insert line break after the first space
Code
# 7.5 Heatmap for Average Rent Price per Square Meter by Neighborhood with Black Border and District Names
ggplot(merged_neighborhood_prices) +
  geom_sf(aes(fill = avg_rent_price_per_sqm), color = "black", size = 0.1) +
  geom_text(data = merged_neighborhood_prices,
            aes(label = NOMBRE_wrapped, geometry = geometry,
                size = pmin(area / max(area) * 3 + 1.5, 3)), # Scale font size with area, max 3
            stat = "sf_coordinates", # Center label on polygon
            color = "black") +
  scale_fill_gradient(low = "lightyellow", high = "darkred", name = "Avg Rent Price\nper Sq Meter") +
  scale_size_continuous(range = c(1.5, 3), guide = "none") + # Control size limits
  labs(title = "Average Rent Price per Square Meter by Neighborhood in Madrid") +
  theme_minimal()

Code
# 7.6 Heatmap for Average Buy Price per Square Meter by Neighborhood with Black Border and District Names
ggplot(merged_neighborhood_prices) +
  geom_sf(aes(fill = avg_buy_price_per_sqm), color = "black", size = 0.1) +
  geom_text(data = merged_neighborhood_prices,
            aes(label = NOMBRE_wrapped, geometry = geometry,
                size = pmin(area / max(area) * 3 + 1.5, 3)), # Scale font size with area, max 3
            stat = "sf_coordinates", # Center label on polygon
            color = "black") +
  scale_fill_gradient(low = "lightyellow", high = "darkred", name = "Avg Buy Price\nper Sq Meter") +
  scale_size_continuous(range = c(1.5, 3), guide = "none") + # Control size limits
  labs(title = "Average Buy Price per Square Meter by Neighborhood in Madrid") +
  theme_minimal()

3.1.9 Distribution of Property Types Across Neighborhoods

The graph highlights the prevalence of flats across all neighborhoods in Madrid, reflecting their dominant role in meeting the city’s housing needs. Duplexes, while less common, are notable in neighborhoods such as Hortaleza, Moncloa , and Tetuán. This pattern could be linked to varying development trends, with duplexes potentially offering a compromise between space and urban accessibility. Their presence in both central and peripheral areas suggests they cater to diverse buyer preferences, from families to professionals seeking larger living spaces within the city. Penthouses, by contrast, are a niche option and show slightly higher proportions in Barajas and Moncloa. These areas, often less dense and quieter than central neighborhoods, may attract buyers interested in more exclusive housing options with added privacy and better views. However, the relatively low overall proportion of penthouses reflects their luxury positioning and limited availability in the market.

Code
# 8. Distribution of Property Types Across Neighborhoods (interactive)

plotly::plot_ly(
  data = data_cleaned %>% count(neighborhood_id, house_type_id) %>%
    group_by(neighborhood_id) %>%
    mutate(proportion = n / sum(n)),
  x = ~neighborhood_id,
  y = ~proportion,
  color = ~house_type_id,
  type = "bar"
) %>%
  layout(
    barmode = "stack",
    title = "Proportion of Property Types Across Neighborhoods",
    xaxis = list(title = "Neighborhood", tickangle = -45),
    yaxis = list(title = "Proportion"),
    legend = list(title = list(text = "House Type"))
  )

3.1.10 Energy Efficiency vs. Age of Property

This graph highlights how energy efficiency ratings vary by property age, reflecting changes in construction practices and regulations over time. Older properties, especially those built before 1950, are predominantly rated F or G, showing low energy efficiency due to outdated building techniques. In contrast, buildings constructed after 2000 exhibit higher energy efficiency, with more frequent ratings of A and B, likely driven by modern regulations and advancements in sustainable construction. The significant presence of “not indicated” and “in process” ratings for newer properties suggests gaps in certification or ongoing assessments.

Code
# 9. Relation Between Energy Efficiency and Age of Property
ggplot(data_cleaned, aes(x = built_year, y = energy_certificate, fill = energy_certificate)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "Energy Efficiency Rating by Property Age",
       x = "Year Built",
       y = "Energy Certificate",
       fill = "Energy Certificate") +
  theme_minimal()

3.1.11 Amenities vs. Prices

The graphs highlight the significant role of amenities in influencing property prices, with lifts showing the largest impact. This reflects their necessity in urban living, especially in multi-story buildings, where accessibility is crucial for families, the elderly, or those with mobility challenges. Properties without lifts, particularly in buildings constructed before lifts became standard, are less appealing, reducing their market value. Parking also drives prices, as secure parking is essential in areas with limited street options, catering to practical needs. Pools and storage rooms add value by addressing lifestyle preferences, appealing to those seeking comfort or extra utility. Terraces, while desirable for outdoor access, show a smaller impact, likely because their value depends on the property’s context, being more appealing in dense urban areas where outdoor space is limited. These trends emphasize how practical amenities like lifts and parking significantly influence demand, while luxury features cater to specific buyer or tenant preferences.

Code
# 10. Amenities vs. Prices
# Pivot data for all amenities
data_cleaned_long <- data_cleaned %>%
  select(rent_price, buy_price, starts_with("has_")) %>%
  pivot_longer(cols = starts_with("has_"), names_to = "amenity", values_to = "presence")
Code
# Boxplot for rent price
ggplot(data_cleaned_long, aes(x = as.factor(presence), y = rent_price, fill = as.factor(presence))) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ amenity, scales = "free_y") +
  labs(title = "Impact of Amenities on Rent Price",
       x = "Amenity Presence",
       y = "Rent Price (€)",
       fill = "Presence") +
  theme_minimal()

Code
# Boxplot for buy price
ggplot(data_cleaned_long, aes(x = as.factor(presence), y = buy_price, fill = as.factor(presence))) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ amenity, scales = "free_y") +
  labs(title = "Impact of Amenities on Buy Price",
       x = "Amenity Presence",
       y = "Buy Price (€)",
       fill = "Presence") +
  theme_minimal()

3.1.12 Amenity Across Property Types

The graph shows how amenities are distributed across property types, reflecting their different market positioning. Penthouses and duplexes typically offer more amenities, such as terraces and parking, appealing to buyers seeking space and luxury. Flats prioritize essential features like lifts, which are crucial for accessibility in urban areas, but have fewer lifestyle-focused amenities like pools or storage rooms, likely due to space and affordability constraints. Penthouses, associated with exclusivity, frequently include terraces and parking, catering to high-end preferences.

Code
# 11. Amenity Presence Across Property Types
# Count of amenities by property type (interactive)
amenities_property_type <- data_cleaned %>%
  pivot_longer(cols = starts_with("has_"), names_to = "amenity", values_to = "presence") %>%
  group_by(house_type_id, amenity) %>%
  summarize(presence_rate = mean(presence, na.rm = TRUE), .groups = "drop")


plotly::plot_ly(
  data = amenities_property_type,
  x = ~house_type_id,
  y = ~presence_rate,
  color = ~amenity,
  type = "bar",
  text = ~paste("Amenity:", amenity, "<br>Property Type:", house_type_id, "<br>Presence Rate:", round(presence_rate, 2)),
  hoverinfo = "text"
) %>%
  layout(
    title = "Amenity Distribution Across Property Types",
    xaxis = list(title = "Property Type", tickangle = -45),
    yaxis = list(title = "Amenity Presence Rate"),
    legend = list(title = list(text = "Amenity"))
  )

3.2 Summary Statistics

The summary statistics of the cleaned dataset reveal a diverse range of properties in Madrid. The mean property size is 106.4 m², but the median is 68.6 m², indicating a predominance of smaller properties. Most properties have around 2 rooms, and rental prices have a median of €1,223, while buying prices are higher, with a median of €315,000. The highest buying price exceeds €1.3 million, highlighting luxury properties in the market.

This diversity has already emerged when visualizing the data: Flats are the most common property type, followed by penthouses and duplexes, with the Centro neighborhood having the highest number of listings. Most properties have an energy certificate marked as “in process” or “not indicated,” and a notable portion in the “E” energy efficiency category. A significant share of properties offers amenities like air conditioning, terraces, and balconies, though many lack parking or accessibility.

With a mean built year of 1968, many properties are older and may require renewal. This reflects the presence of older buildings in the market. Overall, the dataset includes a wide variety of property types, sizes, and conditions, reflecting the diverse housing market in Madrid.

Code
# 7. Summary statistics of the cleaned data
summary(data_cleaned)
       id         sq_mt_built       n_rooms        n_bathrooms    
 Min.   :    3   Min.   : 16.0   Min.   : 0.000   Min.   : 1.000  
 1st Qu.: 4786   1st Qu.: 68.0   1st Qu.: 2.000   1st Qu.: 1.000  
 Median :10324   Median : 90.0   Median : 3.000   Median : 2.000  
 Mean   :10644   Mean   :105.6   Mean   : 2.741   Mean   : 1.675  
 3rd Qu.:16448   3rd Qu.:129.0   3rd Qu.: 3.000   3rd Qu.: 2.000  
 Max.   :21742   Max.   :620.0   Max.   :14.000   Max.   :14.000  
 is_exact_address_hidden     floor        neighborhood_id      rent_price  
 Mode :logical           Min.   :-1.000   Length:7854        Min.   :  35  
 FALSE:1828              1st Qu.: 1.000   Class :character   1st Qu.: 850  
 TRUE :6026              Median : 2.000   Mode  :character   Median :1223  
                         Mean   : 2.602                      Mean   :1335  
                         3rd Qu.: 4.000                      3rd Qu.:1758  
                         Max.   : 9.000                      Max.   :2517  
   buy_price       house_type_id      is_renewal_needed   built_year  
 Min.   :  36000   Length:7854        Mode :logical     Min.   :1800  
 1st Qu.: 180000   Class :character   FALSE:6203        1st Qu.:1956  
 Median : 315000   Mode  :character   TRUE :1651        Median :1970  
 Mean   : 404489                                        Mean   :1968  
 3rd Qu.: 550000                                        3rd Qu.:1991  
 Max.   :1378900                                        Max.   :2022  
   has_ac         has_lift        has_pool       has_terrace    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:3546      FALSE:1928      FALSE:6388      FALSE:4583     
 TRUE :4308      TRUE :5926      TRUE :1466      TRUE :3271     
                                                                
                                                                
                                                                
 has_balcony     has_storage_room is_accessible   energy_certificate
 Mode :logical   Mode :logical    Mode :logical   Length:7854       
 FALSE:6565      FALSE:5386       FALSE:5679      Class :character  
 TRUE :1289      TRUE :2468       TRUE :2175      Mode  :character  
                                                                    
                                                                    
                                                                    
 has_parking    
 Mode :logical  
 FALSE:5519     
 TRUE :2335     
                
                
                

3.3 Key Findings or Patterns

Property Size: Both rental and buying prices generally increase with property size, as expected, with larger properties catering to a premium market. For properties exceeding 200 square meters, price variability increases, likely influenced by factors like location or specific property features.

Rooms and Bathrooms: The number of rooms and bathrooms is a strong determinant of both rental and buying prices, particularly in the buying market, where these features appeal to families and high-end buyers. In the rental market, prices for larger properties appear to hit a ceiling (€2,500), suggesting affordability constraints.

Energy Efficiency: Energy certificates show limited correlation with property prices, indicating that energy efficiency may not yet be a major pricing factor in Madrid. Older properties, especially pre-1950, often have lower efficiency ratings (F or G), while newer properties built after 2000 tend to have higher ratings (A or B), reflecting modern construction standards. Many properties still lack finalized energy certifications, particularly newer ones marked as “in process.”

Neighborhood Influence: Central and northern neighborhoods command the highest average buy and rent prices, likely due to proximity to amenities, prestige, and infrastructure quality. Southern neighborhoods are more affordable, indicating less demand or lower economic development in these areas.

Property Types Across Neighborhoods: Flats dominate the housing market across all neighborhoods, offering a practical and affordable option for urban living. Duplexes and penthouses are less common but are concentrated in neighborhoods like Hortaleza and Moncloa, reflecting a localized demand for more spacious or luxury properties.

Amenities and Prices: Practical amenities like lifts and parking have a significant impact on property prices, particularly in dense urban areas where they are highly valued. Luxury features like pools and terraces also increase prices but cater to a smaller segment of buyers or renters with specific preferences. The distinct price premium associated with lifts underscores their necessity in multi-story buildings.

Amenities by Property Type: Flats focus on essential features like lifts, while penthouses and duplexes are more likely to include high-end amenities such as terraces and parking, aligning with their luxury positioning. This distribution highlights a flexible housing market that caters to both practical and premium demands.

Age and Price: Both very old properties (e.g., pre-1900) and newer constructions tend to command higher prices. Older properties likely appeal to buyers for their historic charm and prime locations, while newer ones attract demand for modern features and energy efficiency. Properties from the mid-20th century tend to be less valued, possibly due to their age and lack of modernization, presenting opportunities for redevelopment or upgrades.

This analysis reflects the complexity and diversity of Madrid’s housing market, shaped by a combination of structural features, location, and buyer preferences. These insights provide a solid foundation for further predictive modeling and targeted recommendations.

4 Analysis

To evaluate the predictive performance of different models for property price data, a structured analysis was conducted. Property buy and rent prices in the dataset were log-transformed (buy_price_log and rent_price_log) to address skewness and improve model performance. Three models were implemented: Lasso Regression, Linear Regression, and Random Forest.

4.1 Data Preparation and Splitting

The dataset was prepared by transforming key variables and incorporating additional features, such as clustering results from Gaussian Mixture Models (GMM). GMM was applied to the log-transformed rental prices (rent_price_log) to classify properties into two categories: “Inexpensive” and “Expensive.” These categories were used as an additional predictor for rent price models. The dataset was then split into training and testing sets using a 90-10 split. The training set (90% of the data) was used to train the models, while the testing set (10%) was reserved for evaluation. This approach ensured robust performance metrics and minimized overfitting.

4.2 Model Evaluation

Three models (Lasso Regression, Linear Regression, and Random Forest) were trained to predict buy_price_log. Their performance was evaluated using the Root Mean Squared Error (RMSE) and R-squared (R²) values. The results are as follows:

Model RMSE
Random Forest 0.1783596 0.9318762
Linear Regression 0.2244886 0.8946683
Lasso Regression 0.2244320 N/A

The Random Forest model achieved the lowest RMSE, demonstrating its ability to capture complex, non-linear relationships in the data. Linear Regression and Lasso Regression performed similarly but less effectively than Random Forest.

The models were also evaluated for predicting rent_price_log. The GMM clustering results (“Inexpensive” and “Expensive”) were included as a feature in these models. The performance metrics were:

Model RMSE
Random Forest 0.1300083 0.9160775
Linear Regression 0.1783225 0.7696447
Lasso Regression 0.1782763 N/A

Random Forest significantly outperformed the other models, with the lowest RMSE and the highest R². The inclusion of GMM clustering likely enhanced the model’s ability to distinguish patterns, further improving its accuracy.

Code
# 1. Create two separate datasets for rent and buy predictions

# Dataset for Rent Prediction
data_cleaned_analysis_rent <- data_cleaned %>%
  mutate(
    is_exact_address_hidden = as.factor(is_exact_address_hidden),
    neighborhood_id = as.factor(neighborhood_id),
    house_type_id = as.factor(house_type_id),
    is_renewal_needed = as.factor(is_renewal_needed),
    energy_certificate = as.factor(energy_certificate),
    gmm_cluster = as.factor(gmm_cluster),  # GMM cluster as factor
    price_category = as.factor(price_category)  # Price category as factor
  ) %>%
  select(-buy_price, -rent_price, -id, -buy_price_log)  # Exclude buy_price_log

# Dataset for Buy Prediction
data_cleaned_analysis_buy <- data_cleaned %>%
  mutate(
    is_exact_address_hidden = as.factor(is_exact_address_hidden),
    neighborhood_id = as.factor(neighborhood_id),
    house_type_id = as.factor(house_type_id),
    is_renewal_needed = as.factor(is_renewal_needed),
    energy_certificate = as.factor(energy_certificate),
    gmm_cluster = as.factor(gmm_cluster),  # GMM cluster as factor
    price_category = as.factor(price_category)  # Price category as factor
  ) %>%
  select(-buy_price, -rent_price, -id, -rent_price_log)  # Exclude rent_price_log

# 2. Splitting the data into training and test datasets

set.seed(123)  # Set seed for reproducibility

# Split data for Rent Prediction
training_data_rent <- data_cleaned_analysis_rent %>%
  sample_frac(0.9)
test_data_rent <- anti_join(data_cleaned_analysis_rent, training_data_rent)

# Split data for Buy Prediction
training_data_buy <- data_cleaned_analysis_buy %>%
  sample_frac(0.9)
test_data_buy <- anti_join(data_cleaned_analysis_buy, training_data_buy)

# 3. Lasso Model
# Adjusted for rent and buy separately

evaluate_lasso <- function(training_data, test_data, target_var, include_gmm = FALSE) {
  predictors <- if (include_gmm) "." else colnames(training_data)[!colnames(training_data) %in% c("gmm_cluster", target_var)]

  x_train <- model.matrix(reformulate(predictors, target_var), data = training_data)[, -1]  # Remove intercept
  y_train <- training_data[[target_var]]
  x_test <- model.matrix(reformulate(predictors, target_var), data = test_data)[, -1]
  y_test <- test_data[[target_var]]

  cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
  best_lambda <- cv_lasso$lambda.min

  lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda)

  predictions <- predict(lasso_model, s = best_lambda, newx = x_test)
  rmse <- sqrt(mean((predictions - y_test)^2))

  return(list(rmse = rmse, model = lasso_model))
}

# Rent Prediction - Lasso
lasso_rent_results <- evaluate_lasso(training_data_rent, test_data_rent, "rent_price_log", include_gmm = TRUE)
rmse_rent_lasso <- lasso_rent_results$rmse
lasso_rent_model <- lasso_rent_results$model

# Buy Prediction - Lasso
lasso_buy_results <- evaluate_lasso(training_data_buy, test_data_buy, "buy_price_log", include_gmm = FALSE)
rmse_buy_lasso <- lasso_buy_results$rmse
lasso_buy_model <- lasso_buy_results$model

# 4. Linear Regression Model
# Adjusted for rent and buy separately

evaluate_lm <- function(training_data, test_data, target_var, include_gmm = FALSE) {
  predictors <- if (include_gmm) "." else colnames(training_data)[!colnames(training_data) %in% c("gmm_cluster", target_var)]

  lm_model <- lm(reformulate(predictors, target_var), data = training_data)
  predictions <- predict(lm_model, newdata = test_data)
  rmse <- sqrt(mean((predictions - test_data[[target_var]])^2))
  r_squared <- summary(lm_model)$r.squared

  return(list(rmse = rmse, r_squared = r_squared, model = lm_model))
}

# Rent Prediction - Linear Regression
lm_rent_results <- evaluate_lm(training_data_rent, test_data_rent, "rent_price_log", include_gmm = TRUE)
rmse_lm_rent <- lm_rent_results$rmse
lm_rent_model <- lm_rent_results$model

# Buy Prediction - Linear Regression
lm_buy_results <- evaluate_lm(training_data_buy, test_data_buy, "buy_price_log", include_gmm = FALSE)
rmse_lm_buy <- lm_buy_results$rmse
lm_buy_model <- lm_buy_results$model

# 5. Random Forest Model
# Adjusted for rent and buy separately

evaluate_rf <- function(training_data, test_data, target_var, include_gmm = FALSE) {
  predictors <- if (include_gmm) "." else colnames(training_data)[!colnames(training_data) %in% c("gmm_cluster", target_var)]

  rf_model <- randomForest(reformulate(predictors, target_var), data = training_data, importance = TRUE, ntree = 500)
  predictions <- predict(rf_model, newdata = test_data)
  rmse <- sqrt(mean((predictions - test_data[[target_var]])^2))
  r_squared <- 1 - sum((test_data[[target_var]] - predictions)^2) / sum((test_data[[target_var]] - mean(test_data[[target_var]]))^2)

  return(list(rmse = rmse, r_squared = r_squared, model = rf_model))
}

# Rent Prediction - Random Forest
rf_rent_results <- evaluate_rf(training_data_rent, test_data_rent, "rent_price_log", include_gmm = TRUE)
rmse_rf_rent <- rf_rent_results$rmse
rf_rent_model <- rf_rent_results$model

# Buy Prediction - Random Forest
rf_buy_results <- evaluate_rf(training_data_buy, test_data_buy, "buy_price_log", include_gmm = FALSE)
rmse_rf_buy <- rf_buy_results$rmse
rf_buy_model <- rf_buy_results$model

# 6. Compare Model Performance
model_comparison <- data.frame(
  Model = c("Lasso - Buy Price Log", "Linear Regression - Buy Price Log", "Random Forest - Buy Price Log",
            "Lasso - Rent Price Log", "Linear Regression - Rent Price Log", "Random Forest - Rent Price Log"),
  RMSE = c(rmse_buy_lasso, rmse_lm_buy, rmse_rf_buy,
           rmse_rent_lasso, rmse_lm_rent, rmse_rf_rent),
  R_Squared = c(NA, lm_buy_results$r_squared, rf_buy_results$r_squared,
                NA, lm_rent_results$r_squared, rf_rent_results$r_squared)
)

# Split data for buy and rent price models
model_comparison_buy <- model_comparison %>%
  filter(grepl("Buy Price Log", Model))

model_comparison_rent <- model_comparison %>%
  filter(grepl("Rent Price Log", Model))

# 7. Visualize Model Performance

# Define custom colors
custom_colors <- c("Rent Price Log" = rent_color, "Buy Price Log" = buy_color)

4.3 Insights from Model Comparison

The Random Forest model consistently demonstrated superior performance across both buy and rent price predictions. Its high R² values (above 0.9) indicate excellent explanatory power, while the low RMSE values suggest strong predictive accuracy. Linear Regression and Lasso Regression were less effective, with comparable RMSEs and lower R² values.

The inclusion of GMM clustering for rent price prediction was particularly beneficial, as it helped the models capture variations in price categories. This feature, combined with the Random Forest’s ensemble approach, resulted in the best overall performance.

4.3.1 Visual Representation of Results

The performance of the models is visually summarized in bar plots, comparing RMSE values for both buy_price_log and rent_price_log. The plots clearly show the superiority of Random Forest in both scenarios. Notably, the RMSE for rent price predictions is significantly lower, underscoring the benefit of additional features like GMM clustering.

Code
# 7.1 Visualize Model Performance for Buy Prices with plotly
plot_ly(
  data = model_comparison_buy,
  x = ~RMSE,
  y = ~Model,
  type = 'bar',
  orientation = 'h',
  marker = list(color = custom_colors["Buy Price Log"])
) %>%
  layout(
    title = "Model RMSE Comparison - Buy Prices",
    xaxis = list(title = "RMSE", range = c(0, 0.25)), # Set y-axis limits up to RMSE = 0.25
    yaxis = list(title = "Model"),
    barmode = 'stack'
  )
Code
# 7.2 Visualize Model Performance for Rent Prices with plotly
plot_ly(
  data = model_comparison_rent,
  x = ~RMSE,
  y = ~Model,
  type = 'bar',
  orientation = 'h',
  marker = list(color = custom_colors["Rent Price Log"])
) %>%
  layout(
    title = "Model RMSE Comparison - Rent Prices",
    xaxis = list(title = "RMSE", range = c(0, 0.25)), # Set y-axis limits up to RMSE = 0.25
    yaxis = list(title = "Model"),
    barmode = 'stack'
  )

4.4 Coefficient and Feature Importance Analysis

4.4.1 Lasso Regression

Positive Influence:

  • The neighborhood of Salamanca exhibits the strongest positive impact on buy prices (0.2842), followed by Chamartín (0.2094) and Chamberí (0.2439).
  • Among property features, the presence of a lift (0.2196), a pool (0.1255), and parking (0.1012) significantly elevate buy prices.

Negative Influence:

  • The neighborhood of Villaverde shows the most substantial negative impact on buy prices (-0.7046), followed by Villa de Vallecas (-0.5698) and Usera (-0.5588).
  • The category “Inexpensive” within price clustering (-0.2212) and properties in need of renewal (-0.0569) also negatively influence buy prices.

Positive Influence:

  • Chamberí (0.9074) is the strongest positively correlated neighborhood for rent prices, with significant contributions also from Centro (0.1402) and Salamanca (0.1258).
  • Property features like the presence of a lift (0.1919), a pool (0.0917), and room count (0.0456) positively influence rent prices.

Negative Influence:

  • Villaverde (-0.4689) and Puente de Vallecas (-0.4235) are the neighborhoods with the most negative effect on rent prices.
  • Properties in the neighborhood of Usera (-0.3497) and with hidden exact addresses (-0.0178) also detract slightly from rental value.
Code
# 8.Coefficients
# 8.1 Lasso Coefficients
# Buy Price Log
lasso_buy_coefficients <- coef(lasso_buy_results$model)
print("Lasso Model Coefficients - Buy Price Log:")
[1] "Lasso Model Coefficients - Buy Price Log:"
Code
print(as.matrix(lasso_buy_coefficients))
                                             s0
(Intercept)                       12.5862032416
sq_mt_built                        0.0040939682
n_rooms                            0.0238049827
n_bathrooms                        0.0878155218
is_exact_address_hiddenTRUE       -0.0115598404
floor                              0.0095748146
neighborhood_idBarajas            -0.1641807296
neighborhood_idCarabanchel        -0.4664457687
neighborhood_idCentro              0.2458996949
neighborhood_idChamartín           0.2094375266
neighborhood_idChamberí            0.2438986195
neighborhood_idCiudad Lineal      -0.2303960941
neighborhood_idFuencarral         -0.1636148176
neighborhood_idHortaleza          -0.1522729690
neighborhood_idLatina             -0.4401949455
neighborhood_idMoncloa             0.0000000000
neighborhood_idMoratalaz          -0.3398784237
neighborhood_idPuente de Vallecas -0.6108928915
neighborhood_idRetiro              0.1398024320
neighborhood_idSalamanca           0.2842067026
neighborhood_idTetuán             -0.0596394005
neighborhood_idUsera              -0.5588335336
neighborhood_idVicálvaro          -0.4986302182
neighborhood_idVilla de Vallecas  -0.5698332661
neighborhood_idVillaverde         -0.7046331822
house_type_idFlat                  0.0314179476
house_type_idPenthouse             0.1469975627
is_renewal_neededTRUE             -0.0569423446
built_year                        -0.0002724513
has_acTRUE                         0.0672759950
has_liftTRUE                       0.2196114924
has_poolTRUE                       0.1255602232
has_terraceTRUE                    0.0000000000
has_balconyTRUE                    0.0452039764
has_storage_roomTRUE               0.0318473226
is_accessibleTRUE                 -0.0022445655
energy_certificateB               -0.0026800704
energy_certificateC                0.0254300804
energy_certificateD                0.0081886119
energy_certificateE               -0.0102196553
energy_certificateexempt          -0.0220461077
energy_certificateF               -0.0426821172
energy_certificateG               -0.0516816872
energy_certificatein process      -0.0152598694
energy_certificatenot indicated   -0.0385308323
has_parkingTRUE                    0.1012281151
price_categoryInexpensive         -0.2212452184
Code
# Rent Price Log
lasso_rent_coefficients <- coef(lasso_rent_results$model)
print("Lasso Model Coefficients - Rent Price Log:")
[1] "Lasso Model Coefficients - Rent Price Log:"
Code
print(as.matrix(lasso_rent_coefficients))
                                             s0
(Intercept)                        6.742623e+00
sq_mt_built                        0.000000e+00
n_rooms                            4.562660e-02
n_bathrooms                        3.879170e-02
is_exact_address_hiddenTRUE       -1.782811e-02
floor                              6.769148e-03
neighborhood_idBarajas            -4.206729e-02
neighborhood_idCarabanchel        -3.151378e-01
neighborhood_idCentro              1.401758e-01
neighborhood_idChamartín           1.085822e-01
neighborhood_idChamberí            9.073874e-02
neighborhood_idCiudad Lineal      -1.783906e-01
neighborhood_idFuencarral         -8.228360e-02
neighborhood_idHortaleza          -8.588804e-02
neighborhood_idLatina             -3.089287e-01
neighborhood_idMoncloa            -1.298284e-02
neighborhood_idMoratalaz          -2.095039e-01
neighborhood_idPuente de Vallecas -4.235016e-01
neighborhood_idRetiro              7.864586e-02
neighborhood_idSalamanca           1.257815e-01
neighborhood_idTetuán             -3.754348e-02
neighborhood_idUsera              -3.497031e-01
neighborhood_idVicálvaro          -3.392995e-01
neighborhood_idVilla de Vallecas  -3.925365e-01
neighborhood_idVillaverde         -4.689061e-01
house_type_idFlat                  0.000000e+00
house_type_idPenthouse             7.402674e-02
is_renewal_neededTRUE             -4.275110e-02
built_year                        -1.677512e-05
has_acTRUE                         4.000533e-02
has_liftTRUE                       1.919225e-01
has_poolTRUE                       9.174482e-02
has_terraceTRUE                    1.728147e-03
has_balconyTRUE                    5.134625e-02
has_storage_roomTRUE               2.801520e-02
is_accessibleTRUE                  3.025108e-03
energy_certificateB               -1.037419e-02
energy_certificateC                3.844382e-02
energy_certificateD                2.306720e-02
energy_certificateE                1.040640e-02
energy_certificateexempt          -1.106932e-05
energy_certificateF               -2.330798e-02
energy_certificateG               -2.564633e-02
energy_certificatein process       0.000000e+00
energy_certificatenot indicated   -1.941262e-03
has_parkingTRUE                    7.695625e-02
gmm_cluster2                       3.692432e-01
price_categoryInexpensive          0.000000e+00

4.4.2 Linear Regression

Positive Impact:

  • Similar to Lasso, Salamanca (0.2778) has the strongest positive coefficient, followed by Chamberí (0.2372) and Centro (0.2375).
  • Building features like lift presence (0.2191) and parking availability (0.1020) also drive buy prices upward.

Negative Impact:

  • Villaverde (-0.7157) shows the largest negative effect, with Villa de Vallecas (-0.5816) and Usera (-0.5694) close behind.
  • Properties categorized as “Inexpensive” (-0.2197) and those in need of renewal (-0.0581) also show depreciation in value.

Positive Impact:

  • Chamberí (0.0720), Salamanca (0.1083), and Centro (0.1209) lead in positively influencing rent prices.
  • The presence of a lift (0.1894), a pool (0.0948), and higher room count (0.0469) are other significant positive contributors.

Negative Impact:

  • Villaverde (-0.4966), Puente de Vallecas (-0.4493), and Usera (-0.3762) continue to show a strong negative correlation.
  • Other neighborhoods like Vicálvaro (-0.3698) and Villa de Vallecas (-0.4212) exhibit a slightly less pronounced negative influence.
Code
# 8.2 Linear Regression Coefficients
# Buy Price Log
lm_buy_coefficients <- coef(lm_buy_results$model)
print("Linear Regression Coefficients - Buy Price Log:")
[1] "Linear Regression Coefficients - Buy Price Log:"
Code
print(lm_buy_coefficients)
                      (Intercept)                       sq_mt_built 
                    12.6596808326                      0.0040935982 
                          n_rooms                       n_bathrooms 
                     0.0243773489                      0.0877609028 
      is_exact_address_hiddenTRUE                             floor 
                    -0.0115554253                      0.0097541643 
           neighborhood_idBarajas        neighborhood_idCarabanchel 
                    -0.1821787652                     -0.4764430615 
            neighborhood_idCentro          neighborhood_idChamartín 
                     0.2375383550                      0.2025848604 
          neighborhood_idChamberí      neighborhood_idCiudad Lineal 
                     0.2371603339                     -0.2405367219 
        neighborhood_idFuencarral          neighborhood_idHortaleza 
                    -0.1758849461                     -0.1626710502 
            neighborhood_idLatina            neighborhood_idMoncloa 
                    -0.4506973011                     -0.0091990722 
         neighborhood_idMoratalaz neighborhood_idPuente de Vallecas 
                    -0.3505309002                     -0.6209920828 
            neighborhood_idRetiro          neighborhood_idSalamanca 
                     0.1333283868                      0.2778219870 
            neighborhood_idTetuán              neighborhood_idUsera 
                    -0.0693628743                     -0.5694465778 
         neighborhood_idVicálvaro  neighborhood_idVilla de Vallecas 
                    -0.5110301186                     -0.5815671222 
        neighborhood_idVillaverde                 house_type_idFlat 
                    -0.7156526324                      0.0380388848 
           house_type_idPenthouse             is_renewal_neededTRUE 
                     0.1545963510                     -0.0580602455 
                       built_year                        has_acTRUE 
                    -0.0002887618                      0.0671753203 
                     has_liftTRUE                      has_poolTRUE 
                     0.2190893570                      0.1269000237 
                  has_terraceTRUE                   has_balconyTRUE 
                    -0.0004272675                      0.0453579469 
             has_storage_roomTRUE                 is_accessibleTRUE 
                     0.0326527468                     -0.0046874949 
              energy_certificateB               energy_certificateC 
                    -0.0465642589                     -0.0143908375 
              energy_certificateD               energy_certificateE 
                    -0.0322925327                     -0.0529813016 
         energy_certificateexempt               energy_certificateF 
                    -0.0689341425                     -0.0864314488 
              energy_certificateG      energy_certificatein process 
                    -0.0950314058                     -0.0569512197 
  energy_certificatenot indicated                   has_parkingTRUE 
                    -0.0810871513                      0.1020217131 
        price_categoryInexpensive 
                    -0.2197216523 
Code
# Rent Price Log
lm_rent_coefficients <- coef(lm_rent_results$model)
print("Linear Regression Coefficients - Rent Price Log:")
[1] "Linear Regression Coefficients - Rent Price Log:"
Code
print(lm_rent_coefficients)
                      (Intercept)                       sq_mt_built 
                     6.808913e+00                     -2.848237e-05 
                          n_rooms                       n_bathrooms 
                     4.696734e-02                      3.890268e-02 
      is_exact_address_hiddenTRUE                             floor 
                    -1.888731e-02                      6.933971e-03 
           neighborhood_idBarajas        neighborhood_idCarabanchel 
                    -8.393016e-02                     -3.411508e-01 
            neighborhood_idCentro          neighborhood_idChamartín 
                     1.209248e-01                      8.987051e-02 
          neighborhood_idChamberí      neighborhood_idCiudad Lineal 
                     7.204392e-02                     -2.037796e-01 
        neighborhood_idFuencarral          neighborhood_idHortaleza 
                    -1.107650e-01                     -1.127320e-01 
            neighborhood_idLatina            neighborhood_idMoncloa 
                    -3.350659e-01                     -3.863642e-02 
         neighborhood_idMoratalaz neighborhood_idPuente de Vallecas 
                    -2.373290e-01                     -4.493170e-01 
            neighborhood_idRetiro          neighborhood_idSalamanca 
                     6.052144e-02                      1.082880e-01 
            neighborhood_idTetuán              neighborhood_idUsera 
                    -6.218216e-02                     -3.761880e-01 
         neighborhood_idVicálvaro  neighborhood_idVilla de Vallecas 
                    -3.698942e-01                     -4.212083e-01 
        neighborhood_idVillaverde                 house_type_idFlat 
                    -4.966432e-01                      1.079504e-02 
           house_type_idPenthouse             is_renewal_neededTRUE 
                     8.603526e-02                     -4.541284e-02 
                       built_year                        has_acTRUE 
                    -3.434943e-05                      3.978917e-02 
                     has_liftTRUE                      has_poolTRUE 
                     1.894294e-01                      9.482849e-02 
                  has_terraceTRUE                   has_balconyTRUE 
                     4.386875e-03                      5.222022e-02 
             has_storage_roomTRUE                 is_accessibleTRUE 
                     2.999985e-02                      2.919627e-03 
              energy_certificateB               energy_certificateC 
                    -3.605638e-02                      2.132267e-02 
              energy_certificateD               energy_certificateE 
                     4.945948e-03                     -8.681439e-03 
         energy_certificateexempt               energy_certificateF 
                    -3.181524e-02                     -4.820292e-02 
              energy_certificateG      energy_certificatein process 
                    -4.935833e-02                     -2.069874e-02 
  energy_certificatenot indicated                   has_parkingTRUE 
                    -2.526084e-02                      7.874968e-02 
                     gmm_cluster2         price_categoryInexpensive 
                     3.686160e-01                                NA 

4.4.3 Random Forest Feature Importance

Key Influencers:

  • Neighborhood is the most influential variable, consistent with the Lasso and Linear Regression results.
  • Square meters built, lift presence, and parking are among the other significant determinants of buy price.

Key Influencers:

  • As with buy prices, neighborhood quality and key features like the presence of a lift or a pool are highly influential for rent prices.
  • The GMM clustering feature captures important property segmentation effects, adding a unique perspective not visible in coefficient-based models.
Code
# 8.3 Random Forest Feature Importance
# Buy Price Log
rf_buy_importance <- importance(rf_buy_results$model)
print("Random Forest Feature Importance - Buy Price Log:")
[1] "Random Forest Feature Importance - Buy Price Log:"
Code
print(rf_buy_importance)
                          %IncMSE IncNodePurity
sq_mt_built              60.56771    953.484905
n_rooms                  32.66390     82.761956
n_bathrooms              37.23879    425.262562
is_exact_address_hidden  10.41319      6.286522
floor                    56.62964     41.722111
neighborhood_id         122.93837    757.507995
house_type_id            32.23609      9.980120
is_renewal_needed        25.73364      7.899665
built_year               48.20690     86.690287
has_ac                   29.64747     17.603961
has_lift                 43.77123    167.115232
has_pool                 21.63180     17.919189
has_terrace              27.35230      8.230963
has_balcony              16.46964      7.268777
has_storage_room         28.08418     12.307732
is_accessible            19.85101      7.658942
energy_certificate       18.78154     31.565577
has_parking              37.55766     40.256615
price_category           34.45941    577.508998
Code
# Rent Price Log
rf_rent_importance <- importance(rf_rent_results$model)
print("Random Forest Feature Importance - Rent Price Log:")
[1] "Random Forest Feature Importance - Rent Price Log:"
Code
print(rf_rent_importance)
                           %IncMSE IncNodePurity
sq_mt_built              57.197217    252.988058
n_rooms                  23.459625     36.126572
n_bathrooms              37.705395    104.385156
is_exact_address_hidden   4.703363      4.279337
floor                    26.183249     23.880334
neighborhood_id         121.424102    313.909393
house_type_id             9.750228      6.807386
is_renewal_needed         9.413943      4.677306
built_year               30.325131     46.857102
has_ac                   14.917694      8.128103
has_lift                 45.699468     81.767352
has_pool                 15.841912     10.207096
has_terrace              10.186054      4.973889
has_balcony               8.059480      3.993677
has_storage_room         13.403766      6.591565
is_accessible            12.170886      5.719043
energy_certificate        3.204401     20.368449
has_parking              27.759988     20.710221
gmm_cluster              22.325361    246.941849
price_category           21.949697    229.352544

4.4.4 Comparative Insights into Coefficients and Feature Importance

4.4.4.1 Consistent Patterns

  • Across all models, neighborhoods like Salamanca, Chamberí, and Centro positively influence both buy and rent prices, though their relative impacts differ. Conversely, Villaverde consistently exhibits the most significant negative influence in all cases.
  • Features such as lift presence and parking availability emerge as significant drivers for both price types, underscoring the importance of these amenities.

4.4.4.2 Divergent Patterns

  • Certain neighborhoods, while positive in both models, exhibit higher impacts on one price type. For instance, Chamberí strongly influences rent prices, whereas Salamanca leads for buy prices.
  • Negative influences from specific neighborhoods show greater magnitude for buy prices than rent prices, with differences in the specific neighborhoods emphasized in each model.

4.4.4.3 Interpretation Random Forest

The Random Forest analysis supports and extends the coefficient findings by emphasizing the interaction effects of neighborhood and property features. These results highlight the dominant role of location and critical property characteristics like size, lift presence, and parking in predicting both buy and rent prices.

5 Conclusion

5.1 Summary

We conducted an extensive analysis of the Madrid housing dataset, including data cleaning and an examination of key factors such as property type, size, number of rooms, energy certificate, year of construction, and location. Visualizations revealed trends in property size, room count, neighborhood influence, and amenities like lifts and parking, all of which significantly impact prices. The core insights show that location has the highest impact on both buying and rental prices, followed by the property having amenities such as a lift and a pool. Other features such as for example the size and type of housing also have a significant influence on the price. This answers our research questions that not only the primary more obvious factors like the size, but also the less intuitive aspects have a significant impact that cannot be ignored. Specifically looking at the trends of location, the analysis shows us that central and northern districts positively influence the prices, while southern areas decrease prices. Predictive modeling identified Random Forest as the most effective tool for capturing price determinants and forecasting trends. These findings provide valuable insights into Madrid’s housing market and support data-driven decision-making for buyers, renters, and investors.

5.2 Final Thoughts

A key challenge in this project has been managing missing values and potential biases within the dataset. For certain variables, particularly boolean ones, we made assumptions to handle missing values. Additionally, the dataset’s static nature, representing a single snapshot in time, limits our ability to capture seasonal trends and evolving market dynamics in Madrid.

While we cannot expand beyond the dataset’s scope, we can highlight specific limitations, such as the lack of time-series data or additional contextual information, that affect our ability to fully capture market trends. These points will help guide recommendations for enhancing similar analyses in the future and achieving a more complete view of Madrid’s housing market.

5.3 References